Method for improving speech quality in speech transmission tasks

ABSTRACT

A method for calculating the amplication factor, which co-determines the volume, for a speech signal transmitted in encoded form includes dividing the speech signal into short temporal signal segments. The individual signal segments are encoded and transmitted separately from each other, and the amplication factor for each signal segment is calculated, transmitted and used by the decoder to reconstruct the signal. The amplication factor is determined by minimizing the value E(g_opt2)=(1−a)*f 1 (g_opt2)+a*f 2 (g_opt2), the weighting factor a being determined taking into account both the periodicity and the stationarity of the encoded speech signal.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a U.S. National Stage Application under 35 U.S.C. §of PCT International Application No. PCT/EP01/02603, filed Mar. 8, 2001,which claims priority to German Patent Application No. DE 100 20 863.0,filed Apr. 28, 2000. Each of these applications is incorporated hereinby reference as if set forth in its entirety.

The present invention relates to a method for calculating theamplification factor which co-determines the volume for a speech signaltransmitted in encoded form.

In the domain of speech transmission and in the field of digital signaland speech storage, the use of special digital coding methods for datacompression purposes is widespread and mandatory because of the highdata volume and the limited transmission capacities. A method which isparticularly suitable for the transmission of speech is the Code ExcitedLinear Prediction (CELP) method which is known from U.S. Pat. No.4,133,976. In this method, the speech signal is encoded and transmittedin small temporal segments (“speech frames”, “frames”, “temporalsection”, “temporal segment”) having a length of about 5 ms to 50 mseach. Each of these temporal segments is not represented exactly butonly by an approximation of the actual signal shape. In this context,the approximation describing the signal segment is essentially obtainedfrom three components which are used to reconstruct the signal on thedecoder side: Firstly, a filter approximately describing the spectralstructure of the respective signal section; secondly, a so-called“excitation signal” which is filtered by this filter; and thirdly, anamplification factor (gain) by which the excitation signal is multipliedprior to filtering. The amplification factor is responsible for theloudness of the respective segment of the reconstructed signal.

The result of this filtering then represents the approximation of thesignal portion to be transmitted. The information on the filter settingsand the information on the excitation signal to be used and on thescaling (gain) thereof which describes the volume must be transmittedfor each segment. Generally, these parameters are obtained fromdifferent code books which are available to the encoder and to thedecoder in identical copies so that only the number of the most suitablecode book entries has to be transmitted for reconstruction. Thus, whencoding a speech signal, these most suitable code book entries are to bedetermined for each segment, searching all relevant code book entries inall relevant combinations, and selecting the entries which yield thesmallest deviation from the original signal in terms of a usefuldistance measure.

There exist different methods for optimizing the structure of the codebooks (for example, multiple stages, linear prediction on the basis ofthe preceding values, specific distance measures, optimized searchmethods, etc.). Moreover, there are different methods describing thestructure and the search method for determining the excitation vectors.

The amplification factor (gain value) can also be determined indifferent ways in a suitable manner. In principle, the amplificationfactor can be approximated using two methods which will be describedbelow:

Method 1: “Waveform Matching”

In this method, the amplification factor is calculated while taking intoaccount the waveform of the excitation signal from the code book. Forthe purpose of calculation, deviation E₁ between original signal x(represented as vector), i.e., the signal to be transmitted, and thereconstructed signal g H c is minimized. In this context, g is theamplification factor to be determined, H is the matrix describing thefilter operation, and c is the most suitable excitation code book vectorwhich is to be determined as well and has the same dimension as targetvector x.E ₁ =∥x−gHc∥ ²

Generally, for the purpose of calculation, optimum code book vectorc-opt is determined first. After that, amplification factor g which isoptimal for this is initially calculated and then, the matching codebook vector g-opt is determined. This calculation yields good valuesevery time that the waveform of the excitation code book vector from thecode book, which vector is filtered with H, corresponds as far aspossible to the input waveform. Generally, this is more frequently thecase, for example, with clear speech without background noises than withspeech signals including background noises. In the case of strongbackground noises, therefore, an amplification factor calculationaccording to method 1 can result in disturbing effects which canmanifest themselves, for example, in the form of volume fluctuations.

Method 2: “Energy Matching”

In this method, amplification factor g is calculated without taking intoaccount the waveform of the speech signal. Deviation E₂ is minimized inthe calculation:E ₂=(∥exc (g)∥−∥res ∥)²

In this context, exc is the scaled code book vector which depends onamplification factor g; res designates the “ideal” excitation signal.Moreover, other previously determined constant code book entries d maybe added:exc (g)=c _(—) opt*g+d

This method yields good values, for example, in the case oflow-periodicity signals, which may include, for example, speech signalshaving a high level of background noise. In the case of low backgroundnoises, however, the amplification values calculated according to method2 generally yield values worse than those of method 1.

In the method used today, initially, optimum code book entry g_optresulting from method 1 is determined and then amplification factorg_opt2, which is quantized, i.e., found in the code book, and which isactually to be used, is determined by minimizing quantity E₃.

$\begin{matrix}{{E_{3}({g\_ opt2})} = {{\left( {1 - a} \right)*{{c\_ opt}}^{2}*\left( {{g\_ opt2} - {g\_ opt}} \right)^{2}} + {a*\left( {{{{exc}({g\_ opt2})}} - {{res}}} \right)^{2}}}} & {{Equation}\mspace{14mu}(1)}\end{matrix}$

In this context, weighting factor a can take values between 0 and 1 andis to be predetermined using suitable algorithms. For the extreme casethat a=0, only the first summand is considered in this equation. In thiscase, the minimization of E₃ always leads to g_opt2=g_opt, so that valueg_opt, which has previously been calculated according to method 1, istaken over as the result of the final amplification value calculation(pure “waveform matching”). In the other extreme case that a=1, however,only the second summand is considered. In this case, always the samesolution then results for g_opt2 as when using method 2 (pure “energymatching”). The value of a will generally be between 0 and 1 andconsequently lead to a result value for g_opt2 which takes into accountboth methods 1 “waveform matching” and 2 “energy matching”.

Thus, the degree to which the result of method 1 or the result of method2 should be used is controlled via weighting factor a. Quantized valuegain-eff2, which is calculated according to equation (1) by minimizingE₃, is then transmitted and used on the decoder side.

The underlying problem now consists in determining weighting factor afor each signal segment to be encoded in such a manner that the mostuseful possible values are found through the calculation according toequation (1) or according to another minimization function in which aweighting between two methods is utilized. In terms of the speechquality of the transmission, “useful values” are values which areadapted as well as possible to the signal situation present in thecurrent signal segment. For noise-free speech, for example, a would haveto be selected to be near 0, in the case of strong background noises, awould have to be selected to be near 1.

In the methods used today, the value of weighting factor a is controlledvia a periodicity measure by using the prediction gain as the basis forthe determination of the periodicity of the present signal. The value ofa to be used is determined via a fixed characteristic curve f(p) fromthe periodicity measure data describing the current signal state, theperiodicity measure being denoted by p. This characteristic curve isdesigned in such a manner that it yields a low value for a for highlyperiodic signals. This means that for highly period signals, preferenceis given to method 1 of “waveform matching”. For signals of lowerperiodicity, however, a higher value is selected for a, i.e., closer to1, via f(p).

In practice, however, it has turned out that this method still resultsin artifacts in the case of certain signals. These include, for example,the beginning of voiced signal portions, so-called “onsets”, or alsonoise signals without periodic components.

SUMMARY OF THE INVENTION

Therefore, an object of the present invention is to provide a method forcalculating the amplification factor which co-determines the volume fora speech signal transmitted in encoded form, which method allows anoptimum weighting factor a to be determined for the calculation of anoptimum amplification factor for a variety of signals.

The present invention provides a method for calculating an amplificationfactor for co-determining a volume for a speech signal transmitted inencoded form, the amplification factor being transmitted and used by adecoder to reconstruct the speech signal. The method includes: dividingthe speech signal into a plurality of short temporal signal segments;encoding and transmitting each signal segment separately from the othersignal segments; calculating the amplification factor for each signalsegment by minimizing a value E(g_opt2), whereE(g_opt2)=(1−a)*f ₁(g_opt2)+a*f2(g_opt2)a being a weighting factor; and taking into account a stationarity and aperiodicity of the encoded speech signal so as to determine theweighting factor a.

In the present invention, the notation f₁ and f₂ is used to denotegeneric functions relating to the optimum code book vector c-opt,amplification factor g_opt2, matching code book vector g-opt, excitationcode book vector exc, and optimum code book entry g_opt. In the exampledescribed above relative to Equation (1), it can be seen thatf₁(g_opt2)=∥c-opt ∥² * (g_(—opt2—g)_opt)². Likewise, it can be seen thatf₂(g_opt2)∥(∥ exc (g_opt2) ∥—∥res∥)². It can be appreciated that f₁ andf₂ are functions which can be selected depending on the desiredoptimization of the structure of the code books, as should be apparentto those of ordinary skill in the art.

In the method according to the present invention, provision is made tonot only use periodicity S₁ of the signal but to also use stationarityS₂ of the signal for determining the weighting factor. Depending on thequality of weighting factor a to be determined, it is possible forfurther parameters which are characteristic of the present signals, suchas the continuous estimation of the noise level, to be taken intoaccount in the determination of the weighting factor. Therefore,weighting factor a is advantageously determined not only fromperiodicity S₁ but from a plurality of parameters. The number of usedparameters or measures will be denoted by N. An improved, more robustdetermination of a can be accomplished by combining the results of theindividual measures. Thus, the value of a to be used is no longer madedependent on one measure only but, via a rule h, it depends on the dataof all N measures S₁, S₂, . . . S_(N) describing the current signalstate. The resulting relationship is shown in equation (2):a=h(S ₁ , S ₂ , . . . S _(N))  (equation 2)

Thus, an embodiment of the method according to the present inventionuses a periodicity measure S₁ and, in addition, a stationarity measureS₂. By additionally taking into account stationarity measure S₂ of thesignal, it is possible to better deal, for example, with the problematiccases (onsets, noise) mentioned above. In this context, in a speechcoding system using the method according to the present invention,initially, the results of periodicity measure S₁ and, of stationaritymeasure S₂are calculated. Then, the suitable value for weighting factora is calculated from the two measures according to equation (2). Thisvalue is then used in equation (1) to determine the best value for theamplification factor.

A concrete way of implementing the assignment rule h(S₁) is, forexample, to use a number K of different characteristic curve shapesh₁(S₁) . . . h_(k)(S₁) and to control, via a parameter S₂,characteristic curve shape h_(i)(S₁) which is to be used in the presentsignal case.

In this context, the following distinctions could be made for K=3:

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a graphical representation of the dependence of weightingfactor a on S₁; and

FIG. 2 shows a graphical representation of the relationship betweenweighting factor a and S₁ for the values of a₁, a_(h), s1₁, and s1_(h)indicated.use a=h₁(S₁), if S_(2a)<S₂<=S_(2b),use a=h₂(S₁), if S_(2b)<S₂<=S_(2c),use a=h₃(S₁), if S_(2c)<S₂<=S_(2d),where S_(2a)<S₂<S_(2d)

DETAILED DESCRIPTION

In the following, the method according to the present invention will beexplained in greater detail with the example that K=2. In this case, theused assignment rule h(.) provides for two different characteristiccurve shapes h₁(S₁) and h₂(S₁). The respective characteristic curve isselected as a function of a further parameter S₂ which is either 0 or 1.

Parameter S1 describes the voicedness (periodicity) of the signal. Theinformation on the voicedness results from the knowledge of input signals(n) (n=0 . . . L, L: length of the observed signal segment) and of theestimate t of the pitch (duration of the fundamental period of themomentary speech segment). Initially, a voiced/unvoiced criterion is tobe calculated as follows:

$\chi = \frac{\sum\limits_{i = 0}^{L - 1}\;{{s(i)} \cdot {s\left( {i - \tau} \right)}}}{\sqrt{\underset{{i = 0}\mspace{11mu}}{\overset{{L - 1}\mspace{11mu}}{\sum\;}}{{s^{2}(i)} \cdot {\sum\limits_{i = 0}^{L - 1}\;{s^{2}\left( {i - \tau} \right)}}}}}$

The parameter S1 used is now obtained by generating the short-termaverage value of χ over the last 10 signal segments (m_(cur): index ofthe current signal segment):

$S_{1} = {\frac{1}{10}{\sum\limits_{i = {m_{cur} - 10}}^{m_{cur}}\;{\chi_{i}.}}}$

FIG. 1 is a schematic representation of the dependence of weightingfactor a on S₁.

Accordingly, the shape of the characteristic curve depends on theselection of threshold values a₁ and a_(h) as well as s1₁ and s1_(h).

The indicated selection of characteristic curve h₁ or h₂ as a functionof S₂ means that different combinations of threshold values (a₁, a_(h),s1₁, s1_(h)) are selected for different values of S₂.

Parameter S₂ contains information on the stationarity of the presentsignal segment. Specifically, this is status information which indicateswhether speech activity (s2=1) or a speech pause (S₂=0) is present inthe signal segment currently observed. This information must be suppliedby an algorithm for detecting speech pauses (VAD=Voice ActivityDetection).

Since the recognition of speech pauses and of stationary signal segmentsare in principle similar, the VAD is not optimized for an exactdetermination of the speech pauses (as is otherwise usual) but for aclassification of signal segments that are considered to be stationarywith regard to the determination of the amplification factor.

Since stationarity S₂ of a signal is not a clearly defined measurablevariable, it will be defined more precisely below.

If, initially, the frequency spectrum of a signal segment is looked at,it has a characteristic shape for the observed period of time. If thechange in the frequency spectra of temporally successive signal segmentsis sufficiently low, i.e., the characteristic shapes of the respectivespectra are more or less maintained, then one can speak of spectralstationarity.

If a signal segment is observed in the time domain, then it has anamplitude or energy profile which is characteristic of the observedperiod of time. If the energy of temporally successive signal segmentsremains constant or if the deviation of the energy is limited to asufficiently small tolerance interval, then one can speak of temporalstationarity.

If temporally successive signal segments are both spectrally andtemporally stationary, then they are generally described as stationary.The determination of spectral and temporal stationarity is carried outin two separate stages. Initially, the spectral stationarity isanalyzed:

Spectral Stationarity (Stage 1)

To determine whether spectral stationarity exists, initially, a spectraldistance measure), the so-called “spectral distortion” SD, of successivesignal segments is observed.

The resulting calculation is as follows:

${SD} = \sqrt{\frac{1}{2\pi}{\int_{- \pi}^{\pi}{\left( {{10{\log\left\lbrack \frac{1}{{{A\left( {\mathbb{e}}^{j\omega} \right)}}^{2}} \right\rbrack}} - {10{\log\left\lbrack \frac{1}{{{A^{\prime}\left( {\mathbb{e}}^{j\omega} \right)}}^{2}} \right\rbrack}}} \right)^{2}\ {\mathbb{d}\omega}}}}$In this context,

$10{\log\left\lbrack \frac{1}{{{A\left( {\mathbb{e}}^{j\omega} \right)}}^{2}} \right\rbrack}$denotes the logarithmized frequency response envelope of the currentsignal segment, and

$10{\log\left\lbrack \frac{1}{{{A^{\prime}\left( {\mathbb{e}}^{j\omega} \right)}}^{2}} \right\rbrack}$denotes the logarithmized frequency response envelope of the precedingsignal segment. To make the decision, both SD itself and its short-termaverage value over the last 10 signal segments are looked at. If bothmeasures SD and are below a threshold value SD_(g), and _(g),respectively, which are specific for them, then spectral stationarity isassumed.Specifically, it applies that SD_(g)=2.6 dB

-   -   SD_(g) =2.6 dB

It is problematic that extremely periodic (voiced) signal segmentsfeature this spectral stationarity as well. They are excluded viaperiodicity measure s1. It applies that:

-   -   If s1≧0.7    -   or s1<0.3        the observed signal segment is assumed not to be spectrally        stationary.        Temporal Stationarity (Stage 2):

The determination of temporal stationarity takes place in a second stagewhose decision thresholds depend on the detection of spectrallystationary signal segments of the first stage. If the present signalsegment has been classified as spectrally stationary by the first stage,then its frequency response envelope

$\frac{1}{{{A\left( {\mathbb{e}}^{j\omega} \right)}}^{2}}$is stored. Also stored is reference energy E_(reference) of residualsignal d_(reference) which results from the filtering of the presentsignal segment with a filter having the frequency response |A(e^(jω))|²which is inverse to this signal segment. E_(reference) results from

$E_{reference} = {\sum\limits_{n = 0}^{L - 1}\;{d_{reference}^{2}(n)}}$where L corresponds to the length of the observed signal segment.

This energy serves as a reference value until the next spectrallystationary segment is detected. All subsequent signal segments are nowfiltered with the same stored filter. Now, energy E_(rest) of residualsignal d_(rest) which has resulted after the filtering is measured.Accordingly, it is expressed as:

$E_{rest} = {\sum\limits_{n = 0}^{L - 1}\;{{d_{rest}^{2}(n)}.}}$The final decision of whether the observed signal segment is stationaryfollows the following rule:

-   -   If: E_(rest)<E_(reference)+tolerance    -   s2=1, signal stationary,    -   otherwise s=0, signal non-stationary

By way of example, the assignment depicted in FIG. 2 applies in thiscontext, where for

-   s2=1 (h1(s1), non-stationary): and-   s2=0 (h2(s1), stationary/pause)→a=1.0 for all s1

This means that the characteristic curve is flat and that a has thevalue 1, independently of s1.

It is, of course, also possible to conceive of a dependency in which acontinuous parameter S₂ (0≦s2≦1) contains information on stationarityS₂. In this case, the different characteristic curves h₁ and h₂ arereplaced with a three-dimensional area h(s1, s2) which determines a.

Of course, the algorithms for determining the stationarity and theperiodicity must or can be adapted to the specific given circumstancesaccordingly. The individual threshold values and functions mentionedabove are exemplary. The individual threshold values and functions maybe found by separate trials.

1. A method for calculating an amplification factor for co-determining avolume for a speech signal transmitted in encoded form, theamplification factor being transmitted and used by a decoder toreconstruct the speech signal, the method comprising the steps of:dividing the speech signal into a plurality of short temporal signalsegments; encoding and transmitting each signal segment separately fromthe other signal segments; calculating the amplification factor for eachsignal segment by minimizing a deviation value E(g_opt2), whereinE(g_opt2)=(1−a)*f ₁(g_opt2)+a*f ₂(g_opt2) wherein g_opt2 is anamplification factory, f₁ represents waveform matching, f₂ representsenergy matching, and a is a weighting factor; and taking into account astationarity and a periodicity of the encoded speech signal so as todetermine the weighting factor a.
 2. The method as recited in claim 1wherein the minimizing of the value E(g_opt2) is performed using theequation:E₃(g_opt2) = (1 − a) * c_opt² * (g_opt2 − g_opt)² + a * (exc(g_opt2) − res)²wherein c_opt is an optimum codebook vector, g_opt is an optimumcodebook entry, exc is a scaled codebook vector, and res is an idealexcitation signal.
 3. The method as recited in claim 1 wherein the stepof taking into account a stationarity and a periodicity of the encodedspeech signal is performed by selecting a function h₁(S1) as a functionof a value determined for the stationarity of the encoded speech signal,S₁ being a measure of the periodicity of the encoded speech signal. 4.The method as recited in claim 3 wherein the stationarity is a measureof speech activity.
 5. The method as recited in claim 3 wherein thestationarity is a measure of a ratio of speech level to background noiselevel of a respective signal segment.
 6. The method as recited in claim1 further comprising the step of calculating the stationarity as afunction of a spectral change and an energy change.
 7. The method asrecited in claim 6 wherein the energy change is a measure of temporalstationarity.
 8. The method as recited in claim 6 wherein the step ofcalculating the stationarity is performed by taking into account atleast one temporally preceding signal segment.
 9. The method as recitedin claim 8 further comprising the step of determining the energy changeas a function of the spectral change.
 10. A method for determining aweighting factor to be applied in a calculation of an amplificationfactor for co-determining a volume for a speech signal transmitted inencoded form, the method comprising the steps of: dividing the speechsignal into a plurality of temporal signal segments; encoding andtransmitting each signal segment separately from the other signalsegments; calculating the weighting factor a based on a stationarity anda periodicity of the encoded speech signal; and calculating theamplification factor for each signal segment by minimizing a deviationbetween an original signal and a reconstructed signal in accordance withthe weighting factor a.
 11. The method according to claim 10, whereinthe step of calculating the weighting factor a comprises the step of:calculating the periodicity based on the length of a respective temporalsignal segment and an estimate of a pitch of the respective temporalsignal segment.
 12. The method according to claim 11, wherein the stepof calculating the periodicity further comprises the step of:calculating a voiced/unvoiced criterion based on the length of arespective temporal signal segment and an estimate of a pitch of therespective temporal signal segment; and generating a short-term averagevalue of the temporal signal segments.
 13. The method according to claim10, wherein the step of calculating the weighting factor a comprises thestep of: calculating the stationarity of a respective signal segmentbased on a spectral stationary and a temporal stationarity of therespective signal segment.
 14. The method according to claim 13, whereinthe step of calculating the stationarity of a respective signal segmentcomprises the steps of: determining the spectral distortion of arespective signal segment; calculating a short-term average value of thespectral distortion over a series of preceding segments; and evaluatingif both the spectral distortion of the respective signal segment and theshort-term average value of the spectral distortion are below athreshold value to determine spectral stationarity.
 15. The methodaccording to claim 13, wherein the step of calculating the weightingfactor a comprises the step of: calculating a temporal stationarity ofthe respective signal segment if the respective signal segment isdetermined to be spectrally stationary.
 16. The method according toclaim 15, wherein the step of calculating a temporal stationarity of therespective signal segment comprises the steps of: storing a frequencyresponse envelope of the respective signal segment; filtering therespective signal segment with a filter having an inverse frequencyresponse to that of the respective signal segment; calculating areference energy of the respective signal segment; storing the referenceenergy of the respective signal segment; filtering subsequent signalsegments to determine an energy of a residual signal; and determining ifthe respective signal segment is stationary based upon whether theresidual signal energy is greater than the reference energy.
 17. Themethod according to claim 10, further comprising the step of: selectinga respective characteristic curve as a function of the stationarity andthe periodicity of the encoded speech signal.